String Interning
   HOME

TheInfoList



OR:

In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool. The single copy of each string is called its ''intern'' and is typically looked up by a method of the string class, for example String.intern() in
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
. All compile-time constant strings in Java are automatically interned using this method. String interning is supported by some modern
object-oriented Object-oriented programming (OOP) is a programming paradigm based on the concept of "objects", which can contain data and code. The data is in the form of fields (often known as attributes or ''properties''), and the code is in the form of pro ...
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
s, including Java,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
,
PHP PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group ...
(since 5.4),
Lua Lua or LUA may refer to: Science and technology * Lua (programming language) * Latvia University of Agriculture * Last universal ancestor, in evolution Ethnicity and language * Lua people, of Laos * Lawa people, of Thailand sometimes referred t ...
and
.NET languages CLI languages are computer programming languages that are used to produce libraries and programs that conform to the Common Language Infrastructure (CLI) specifications. With some notable exceptions, most CLI languages compile entirely to the Com ...
.
Lisp A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech. Types * A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lisping ...
,
Scheme A scheme is a systematic plan for the implementation of a certain idea. Scheme or schemer may refer to: Arts and entertainment * ''The Scheme'' (TV series), a BBC Scotland documentary series * The Scheme (band), an English pop band * ''The Schem ...
,
Julia Julia is usually a feminine given name. It is a Latinate feminine form of the name Julio and Julius. (For further details on etymology, see the Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e.g ...
,
Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sa ...
and
Smalltalk Smalltalk is an object-oriented, dynamically typed reflective programming language. It was designed and created in part for educational use, specifically for constructionist learning, at the Learning Research Group (LRG) of Xerox PARC by Alan Ka ...
are among the languages with a
symbol A symbol is a mark, sign, or word that indicates, signifies, or is understood as representing an idea, object, or relationship. Symbols allow people to go beyond what is known or seen by creating linkages between otherwise very different conc ...
type that are basically interned strings. The library of the
Standard ML of New Jersey Standard ML of New Jersey (SML/NJ; Standard Meta-Language of New Jersey) is a free and open-source compiler and programming environment for the Standard ML programming language. Aside from its runtime system, which is written in C, SML/NJ is wr ...
contains an atom type that does the same thing.
Objective-C Objective-C is a general-purpose, object-oriented programming language that adds Smalltalk-style messaging to the C programming language. Originally developed by Brad Cox and Tom Love in the early 1980s, it was selected by NeXT for its NeXTS ...
's selectors, which are mainly used as method names, are interned strings. Objects other than strings can be interned. For example, in Java, when primitive values are boxed into a wrapper object, certain values (any boolean, any byte, any char from 0 to 127, and any short or int between −128 and 127) are interned, and any two boxing conversions of one of these values are guaranteed to result in the same object.


History

Lisp A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech. Types * A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lisping ...
introduced the notion of interned strings for its
symbols A symbol is a mark, sign, or word that indicates, signifies, or is understood as representing an idea, object, or relationship. Symbols allow people to go beyond what is known or seen by creating linkages between otherwise very different conc ...
. Historically, the data structure used as a string intern pool was called an ''oblist'' (when it was implemented as a linked list) or an ''obarray'' (when it was implemented as an array). Modern Lisp dialects typically distinguish symbols from strings; interning a given string returns an existing symbol or creates a new one, whose ''name'' is that string. Symbols often have additional properties that strings do not (such as storage for associated values, or namespacing): the distinction is also useful to prevent accidentally comparing an interned string with a not-necessarily-interned string, which could lead to intermittent failures depending on usage patterns.


Motivation

String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as
compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
s and
dynamic programming language In computer science, a dynamic programming language is a class of high-level programming languages, which at runtime execute many common programming behaviours that static programming languages perform during compilation. These behaviors co ...
runtimes) that rely heavily on
associative array In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In mathematical terms an ...
s with string keys to look up the attributes and methods of an object. Without interning, comparing two distinct strings may involve examining every character of both.The string comparison can halt at the first character mismatch. For strict equality, the lengths of the strings can also be compared before traversing the string: but finding the length of
null-terminated string In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character (a character with a value of zero, called NUL in this article). Alternative names are C str ...
s does itself require traversing the string.
This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of
memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered, ...
, which take time; and the reads fill up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all. String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a
network Network, networking and networked may refer to: Science and technology * Network theory, the study of graphs as a representation of relations between discrete objects * Network science, an academic field that studies complex networks Mathematics ...
or from storage. Such strings may include magic numbers or
network protocol A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules, syntax, semantics and synchroniza ...
information. For example, XML parsers may intern names of tags and attributes to save memory. Network transfer of objects over Java RMI serialization object streams can transfer strings that are interned more efficiently, as the String object's handle is used in place of duplicate objects upon serialization.


Issues


Multithreading

If the interned strings are not immutable, one source of drawbacks is that string interning may be problematic when mixed with multithreading. In many systems, string interns are required to be global across all threads within an address space (or across any contexts which may share pointers), thus the intern pool(s) are global resources that should be synchronized for safe concurrent access. While this only affects string creation (where the intern pool must be checked and modified if necessary), and
double-checked locking In software engineering, double-checked locking (also known as "double-checked locking optimization") is a software design pattern used to reduce the overhead of acquiring a lock by testing the locking criterion (the "lock hint") before acquiring th ...
may be used on platforms where this is a safe optimization, the need for mutual exclusion when modifying the intern pool can be expensive. Contention can also be reduced by partitioning the string space into multiple pools, which can be synchronized independently of one another.


Reclaiming unused interned strings

Many implementations of interned strings do not attempt to reclaim (manually or otherwise) strings that are no longer used. For applications where the number of interned strings is small or fixed, or which are short-lived, the loss of system resources may be tolerable. But for long-running systems where large numbers of string interns are created at runtime, the need to reclaim unused interns may arise. This task can be handled by a
garbage collector A waste collector, also known as a garbageman, garbage collector, trashman (in the US), binman or (rarely) dustman (in the UK), is a person employed by a public or private enterprise to collect and dispose of municipal solid waste (refuse) and r ...
, though for this to work correctly
weak reference In computer programming, a weak reference is a reference that does not protect the referenced object from collection by a garbage collector, unlike a strong reference. An object referenced ''only'' by weak references – meaning "every chain of ref ...
s to string interns must be stored in the intern pool.


See also

*
Flyweight pattern In computer programming, the flyweight software design pattern refers to an object that minimizes memory usage by sharing some of its data with other similar objects. The flyweight pattern is one of twenty-three well-known '' GoF design patterns' ...


Notes


References

{{Reflist


External links


Visual J# String class.NET String ClassGuava Java Library - Interner - Non-permgen String.intern and supports other immutable types with weak and strong referenced implementations
Software optimization Interning